AD-A064  727 


UNCLASSIFIED 


AIR  FORCE  INST  OF  TECH  WRIGHT-PATTERSON  AFB  OHIO  SCH— ETC  F/G  17/2 
OBJECTIVE  MEASURE  OF  SPEECH  INTELLIGIBILITY  USING  LINEAR  PREDIC--ETC <U> 
DEC  78  D M OTTINGER 

AFIT/GE/EE/78-35  NL 


I » I 

A?P 64  727 


DDC  FILE  COPY  «A064?2* 


» 


OBJECTIVE  MEASURE  OF  SPEECH 
INTELLIGIBILITY  USING 
LINEAR  PREDICTIVE  CODING 

THESIS 


AFIT/GE/EE/ 78 -3 5 Donald  M.  Ottinger 

Captain  USAF 


__ 


I 


Approved  for  public  release;  distribution  unlimited. 


I 


ft) 

AFIT/GE/EE/78-35 


T 


OBJECTIVE  MEASURE  OF  SPEECH 

/ 

INTELLIGIBILITY  USING 
LINEAR  PREDICTIVE  CODING 


1 ™ESIS, 


Presented  to  the  Faculty  of  the  School  of  Engineering 
of  the  Air  Force  Institute  of  Technology 
Air  Training  Command 
in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of 

Master  of  Science  — . 


Mtr 


\ by 

C joy  7 — l/ 

i " Donald  MVOttinger,  Jr.;  B.S. 

Captain  USAF 

Graduate  Electrical  Engineering 


///  ' 


Decamp—  1#78 

UhYaT 


Approved  for  public  release;  distribution  unlimited 


i 


I 


» 


* 


Preface 

The  ability  to  objectively  measure  speech  intelligibility 
has  long  been  a goal  of  the  communications -engineering  community. 
A few  automated  techniques  have  been  developed  in  the  past  years, 
but  to  date,  no  technique  has  fulfilled  all  the  requirements 
desired  of  an  automated  system.  The  subjective  scoring  of  speech 
intelligibility  by  trained  listeners  still  remains  the  most 
reliable,  though  maybe  the  most  expensive,  means  of  measuring 
intelligibility. 

Linear  predictive  coding  has  appeared  on  the  horizon  of  com- 
munications theory  of  late,  and  in  preliminary  systems  has  proven 
quite  effective  in  producing  synthetic  speech.  The  question 
arises,  if  linear  predictive  coding  can  be  used  to  produce  high 
quality  synthetic  speech,  then  why  can't  it  be  used  to  measure 
the  quality  of  human  speech?  This  study  addresses  itself  to 
this  question  by  developing  objective  measures  of  speech  intelli- 
gibility based  on  linear  predictive  coding  and  measuring  their 
effectiveness . 

I am  deeply  indebted  to  Mr.  William  Hall  and  Mr.  Dave  McGrew 
for  their  invaluable  help  in  processing  the  analog  speech  data 
and  for  the  use  of  the  computer  resources  of  the  Analog/Hybrid 
System  Branch  of  the  ASD  Computer  Center.  I wish  to  thank 
Major  Joseph  Carl,  my  advisor,  for  his  guidance,  assistance,  and 
encouragement  during  this  study  and  to  Mr.  Richard  McKinley  of 
the  Aerospace  Medical  Research  Laboratory  for  the  use  of  their 
audio  test  equipment. 

Donald  M.  Ottinger,  Jr. 
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Abstract 

Four  distance  measures  of  speech  intelligibility  based 
on  linear  predictive  coding  (LPC)  are  developed  and  evaluated. 
The  data  base  used  for  evaluating  the  measures  consisted  of 
lists  of  58  words  from  Diagnostic  Rhyme  Test  IV.  The  lists 
were  transmitted  over  a spread  spectrum  radio  communications 
channel  and  subjected  to  7 different  levels  of  non-white,  non- 
Gaussian  jamming  noise.  The  lists  were  all  scored  subjectively 
for  intelligibility  by  a trained  listener  panel.  The  subjec- 
tive scores  were  used  to  judge  the  effectiveness  of  the  four 
distance  measures. 

The  Articulation  Index  was  also  calculated  for  each  of  the 
word  lists  and  compared  to  the  LPC  measures  as  to  effectiveness 
and  efficiency  in  measuring  speech  intelligibility.  The  Arti- 
culation Index  was  significantly  more  effective  than  the  LPC 
measures.  The  best  LPC  measure  provided  42$  correlation  with 
the  subjective  scores.  The  Articulation  Index  provided  69% 
correlation.  The  overhead  associated  with  data  tape  alignment 
and  parameter  computation  makes  LPC  measures  extremely  ineffi- 
cient as  compared  to  the  Articulation  Index. 
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OBJECTIVE  MEASURE  OF  SPEECH 


INTELLIGIBILITY  USING 
LINEAR  PREDICTIVE  CODING 

I . Introduction 

A continuing  need  exists  within  the  military  to  measure 
the  intelligibility  of  speech  produced  on  communications 
systems.  While  methods  exist  for  measuring  system  parameters 
such  as  signal-to-noise  ratio  and  idle  channel  noise,  very 
few  tests  are  available  for  measuring  the  actual  intelligibil- 
ity of  the  speech  produced  at  the  receiver  of  the  communication 
channel.  Examples  of  situations  in  which  a measure  of  speech 
intelligibility  is  needed  include  the  comparative  testing  of 
similar  voice  communications  equipments,  on-line  evaluation 

of  voice  channel  quality,  and  measurement  of  the  effectiveness 

/ 

of  spectrum  jamming  in  destroying  communications  capabilities. 

The  purpose  of  this  thesis  is  to  explore  a relatively  new 
technique  used  for  speech  analysis  called  linear  predictive 

I 

coding  (LPC)  to  determine  if  LPC  can  form  the  basis  for  a 
measure  of  speech  intelligibility. 

Present  Measurement  Systems 

The  oldest  and  most  reliable  test  of  speech  intelligibility 
is  subjective  scoring.  Subjective  scoring  involves  trained 
speakers  reading  a list  of  words  over  a communications  channel,  j 

while  a panel  of  trained  listeners  subjectively  scores  the 
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intelligibility  of  the  received  speech.  The  method  is 
extremely  reliable  due  to  the  fact  that  actual  human  listen- 
ers are  involved  and  no  equipment  is  required  to  attempt  to 
model  the  human  hearing  process.  However,  the  fact  that 
human  listeners  are  used  accounts  for  the  numerous  disadvan- 
tages of  subjective  scoring.  If  two  communications  systems 
are  to  be  comparatively  tested,  the  same  group  of  listeners 
must  be  used  to  prevent  distortion  of  the  intelligibility 
scores  due  to  a difference  between  the  hearing  abilities  of 
two  groups.  If  a significant  number  of  intelligibility  tests 
are  required  or  the  number  of  systems  to  be  tested  is  quite 
large,  considerable  time  must  be  spent  in  the  testing  process. 
If  the  listener  group  is  large,  considerable  manhours  and, 
therefore,  expense  will  be  expended  in  testing  the  systems. 

For  tests  evaluating  the  quality  of  speech  over  on-line  commu- 
nication channels  or  measuring  the  effect  of  jamming  on  dis- 
rupting communications,  the  use  of  a controlled  listener  group 
is  impractical  if  not  impossible. 

The  decrease  the  manhours,  expense,  and  impractical ity  of 
subjective  intelligibility  tests,  an  automated  system  is  needed 
that  will  accurately  measure  speech  intelligibility  without  the 
use  of  listener  groups.  The  most  important  quality  of  any 
automated  system  developed  is  the  ability  to  produce  the  same 
results  that  a listener  panel  would  have  produced.  To  date, 
the  only  automated  technique  in  widespread  use  is  the 
Articulation  Index. 

The  Articulation  Index  (AI)  is  an  automated  speech  intelli- 
gibility measure  that  was  first  described  by  French  and 
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Steinberg  in  1947  (Ref  1:10).  The  Articulation  Index  is 
computed  by  measuring  the  signal-to-noise  power  ratio  (SNR)  in 
twenty  separate  audio  frequency  bands.  The  SNR  value  for  each 
frequency  is  then  weighted  according  to  its  contribution  to 
the  intelligibility  of  spee<_u.  The  American  National  Standards 
Institute  has  established  weights  to  be  applied,  dependent  upon 
the  communications  environment  and  the  type  of  distortion  pre- 
sent in  the  communications  channel  (Ref  2) . The  sum  of  the 
weighted  SNR's  are  scaled  to  produce  an  intelligibility  score 
with  a range  from  zero  to  one.  An  Articulation  Index  of  one 
indicates  that  the  speech  is  perfectly  intelligible,  while  a 
value  of  zero  indicates  a total  lack  of  intelligibility. 

Hardware  is  presently  available  that  can  calculate  the 
Articulation  Index  directly.  One  system  which  has  been  used 
extensively  in  military  applications  is  the  Voice  Interference 
Analysis  System  (VIAS)  (Ref  1:11).  The  system  measures  the 
SNR  in  fourteen  separate  frequency  bands,  as  opposed  to  twenty 
bands  as  specified  by  French  and  Steinberg,  to  calculate  the 
Articulation  Index.  Reasonably  accurate  results  have  been 
achieved  using  the  equipment  as  long  as  the  interfering  noise 
present  was  white  and  Gaussian,  and  any  other  distortion  pre- 
sent in  the  communication  channel  was  known  apriori. 

The  VIAS  system  appears  to  provide  a system  for  evaluating 
speech  intelligibility  when  testing  communications  equipment 
performance  in  the  presence  of  known  channel  distortions. 
However,  the  system  is  not  suited  for  on-line  channel  measure- 
ments or  for  studying  the  effects  of  real-time  jamming  of 
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communications  where  the  type  of  distortions  present  are  not 
known  beforehand.  The  need,  therefore,  is  for  an  automated 
system  that  can  accurately  predict  speech  intelligibility 
without  prior  knowledge  of  channel  distortion  types. 

Purpose 

The  purpose  of  this  thesis  is  to  evaluate  the  use  of  a 
mathematical  technique  called  linear  predictive  coding  (LPC) 
as  a basis  for  developing  an  objective  measure  of  speech 
intelligibility.  Linear  predictive  coding  is  not  a new  tech- 
nique and  can  be  traced  back  to  the  works  of  Gauss  in  1795 
(Ref  11:10).  However,  the  use  of  linear  prediction  in  commu- 
nications theory  only  first  appeared  in  1949  in  the  works  of 
Norbert  Weiner  (Ref  12).  More  recently,  Saito  and  Itakura 
began  applying  linear  prediction  to  the  formulation  of  a human 
vocal  tract  model  used  to  synthesize  speech.  The  use  of  linear 
prediction  for  the  synthesis  of  speech  suggests  that  the  tech- 
nique might  be  successfully  applied  to  the  analysis  of  the 
intelligibility  of  speech.  A study  done  by  Hartman  (Ref  8) 
has  proven  that  linear  prediction  can,  in  fact,  form  the  basis 
for  an  accurate  measure  of  intelligibility  when  the  distortion 
present  is  additive  white  Gaussian  noise.  In  a similar  study 
done  by  the  Georgia  Institute  of  Technology  (Ref  4)  , several 
intelligibility  measures  based  on  LPC  were  tested.  The  tests 
were  made  on  a communications  system  subjected  to  various  types 
of  distortion  that  could  be  expected  to  occur  in  communication 
channels  and  with  digital  voice  equipment.  Of  all  the 
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objective  measures  tested,  the  LPC  based  measures  provided  the 
best  results.  However,  the  data  base  used  was  severely  limited 


and,  therefore,  resulted  in  large  estimated  standard  deviations 
for  the  measures. 

This  thesis  will  evaluate  the  LPC  based  objective  measures 
developed  by  Hartman  and  the  Georgia  Institute  of  Technology 
against  the  data  base  created  by  subjecting  a spread  spectrum 
communications  system  to  non-white  jamming  noise.  The  data 
base  was  created  by  J.E.  Bauer  (Ref  5)  and  consists  of  mono- 
syllabic words  selected  from  the  Harvard  Diagnostic  Rhyme 
Test.  The  data  base  has  been  subjectively  scored  for  intelli- 
gibility as  well  as  being  scored  by  use  of  the  Articulation 
Index.  The  evaluation  of  the  LPC  measures  will  be  based  on 
their  correlation  with  the  subjective  scores  and  their  relative 
advantage  or  disadvantage  over  the  Articulation  Index. 
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I I . Data  Base 

The  data  base  used  in  this  study  was  created  by  J.E.  Bauer 
(Ref  5)  and  modified  to  the  form  used  in  this  test  by  Wayne  R. 
Beeson  (Ref  6) . The  core  of  the  data  base  consists  of  fifty- 
eight  rhyming  word  pairs  from  the  Diagnostic  Rhyme  Test  Number 
IV  (DRT-IV).  DRT-IV  was  used  since  the  list  is  phonetically 
balanced  and  tests  for  six  specific  speech  attributes.  The 
speech  attributes  are  voicing,  nasality,  sustenation,  sibila- 
tion, graveness,  and  compactness  (Ref  6:9).  Table  I shows 
the  word  pairs  used  in  the  data  base  and  the  specific  attribute 
associated  with  each  pair. 

Four  master  lists  of  fifty-eight  words  each  were  created 
by  randomly  selecting  one  word  from  each  rhyming  pair  of 
DRT-IV.  Two  speakers  were  used  as  test  subjects,  one  a male 
subject  with  a southern  accent  (Arkansas)  and  the  other  a male 
subject  with  no  noticeable  regional  accent  (Minnesota).  Each 
speaker  recorded  two  of  the  four  fifty-eight-word  master  lists. 
The  lists  were  recorded  on  stereo  audio  tape  with  one  channel 
used  for  each  word  list  and  the  other  channel  for  timing  marks 
between  each  word.  The  timing  marks  consisted  of  a one  kilo- 
hertz (kHz)  sine  wave,  one-half  second  long,  which  was  used  both 
to  cue  the  speaker  to  say  a word  and  to  provide  a marker  separ- 
ate from  the  actual  data  channel  to  identify  the  interval  of 
tape  in  which  a word  was  recorded.  The  timing  marks  were  spaced 
seven  seconds  apart,  and  Beeson  found  that  the  reaction  time  of 
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TABLE  I 


Diagnostic  Rhyme  Test 


PEST 

- 

TEST 

- (filler) - 

FAN 

VAULT 

- 

FAULT 

- (voicing) - 

CHOCK 

DUES 

- 

NEWS 

- (nasality) - 

NOTE 

VEE 

- 

BEE 

- (sustention) - 

TICK 

THANK 

- 

SANK 

- (sibilation) - 

CARE 

ROD 

- 

WAD 

- (graveness) - 

DONG 

SO 

- 

SHOW 

- (compactness) - 

YOU 

LID 

- 

RID 

- (filler) - 

REEK 

DENSE 

- 

TENSE 

- (voicing) - 

GAFF 

BOSS 

- 

MOSS 

- (nasality) - 

BOMB 

FOO 

- 

POOH 

- (sustention) - 

DOUGH 

ZEE 

- 

THEE 

- (sibilation) - 

GILT 

FAD 

- 

THAD 

- (graveness) - 

PENT 

HOP 

- 

FOP 

- (compactness) - 

YAWL 

ROW 

- 

LOW 

- (filler) - 

LOOT 

GIN 

- 

CHIN 

- (voicing) - 

VEAL 

BEND 

- 

MEND 

- (nasality) - 

NAB 

CHAW 

- 

SHAW 

- (sustention) - 

BON 

JUICE 

- 

GOOSE 

- (sibilation) - 

SOLE 

PEAK 

- 

TEAK 

- (graveness) - 

THIN 

BAT 

- 

GAT 

- (compactness) - 

KEG 

ROCK 

- 

LOCK 

- (filler) - 

LONG 

GOAT 

- 

COAT 

- (voicing) - 

TUNE 

MIT 

- 

BIT 

- (nasality) - 

MEAT 

THEN 

- 

DEN 

- (sustention) - 

SHAD 

GAUZE 

- 

JAWS 

- (sibilation) - 

GOT 

NOON 

- 

MOON 

- (graveness) - 

DOLE 

KEY 

- 

TEA 

- (compactness) - 

DILL 

RAMP 

- 

LAMP 

- (filler) - 

L :ND 

PAN 

JOCK 

DOTE 

THICK 

CHAIR 

BONG 

RUE 

LEAK 

CALF 

MOM 

THOUGH 

JILT 

TENT 

WALL 

ROOT 

FEEL 

DAB 

VON 

THOLE 

FIN 

PEG 

WRONG 

DUNE 

BEAT 

CHAD 

JOT 

BOWL 

GILL 

REND 
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the  speakers  was  such  that  the  word  was  spoken  (and,  therefore, 
recorded)  within  the  first  two  and  one-half  seconds  after  the 
timing  mark.  The  master  tapes  represented  the  baseline  speech 
signal  from  which  all  intelligibility  distances  were  measured. 

The  baseline  tapes  were  played  through  a spread  spectrum 
communications  system  with  seven  different  levels  of  jamming 
noise  added  to  the  signal.  The  recordings  of  the  receiver  out- 
put were  labeled  as  to  the  signal-to-signal  jamming  ratio 
present  in  the  system.  The  output  tapes  were  labeled  1 through 
7,  with  1 signifying  the  lowest  jamming  level  and  2 through  7 
signifying  an  increasing  level  of  jamming  (Ref  5). 

Since  the  data  base  was  entirely  in  analog  form,  the  data 
had  to  be  converted  to  a digital  format  to  allow  digital  com- 
puter processing.  The  analog  to  digital  conversion  was  done 
by  the  Analog/Hybrid  Branch  of  the  Aeronautical  Systems 
Division  (ASD)  Computer  Center,  Wright-Patterson  Air  Force 
Base,  Ohio. 

Analog  Processing 

A Comcor  CI5000/6  analog  computer  was  used  to  pre-process 
and  sample  the  data.  Figure  1 is  a block  diagram  of  the  analog 
data  processing.  To  insure  that  the  analog  data  effectively 
used  the  full  range  of  the  analog- to-digital  converter,  the 
speech  signals  were  amplified  so  that  the  peak-to-peak  voltage 
swings  were  from  -75  to  +75  volts.  The  amplifiers  were  followed 
by  a 4-pole  Chebyshev  low-pass  filter  with  a cutoff  frequency 
of  4 kHz.  The  output  of  the  low-pass  filter  was  then  fed  into 
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the  sampling  circuit. 

The  1 kHz  timing  marks  were  used  to  activate  the  sampling 
circuit.  As  the  speech  waveform  was  being  amplified  and 
filtered,  the  timing  marks  were  being  amplified  and  detected. 
To  detect  the  1 kHz  sine  wave  and  activate  the  sampling  cir- 
cuit, a network  consisting  of  a comparator  and  delay  flip-flop 
was  used.  After  being  amplified  to  a peak  voltage  of  approx- 
imately 125  volts,  the  sine  wave  was  applied  to  a comparator. 
The  other  input  of  the  comparator  was  tied  to  a 100  volt  DC 
level.  Whenever  the  input  tone  was  greater  than  100  volts, 
the  comparator  output  was  a logical  1 (+5  volts) ; any  other 
time  the  output  was  a logical  0 (0  volts) . As  long  as  the 
tone  was  present,  the  output  of  the  comparator  was  a pulse 
train  with  the  same  period  as  the  sine  wave  input. 

The  output  of  the  comparator  was  used  as  a clock  signal 
to  a delay  flip-flop  with  the  interval  timer  set  at  2 milli- 
seconds. The  delay  flip-flop  is  trailing  edge  triggered,  so 
that  on  the  trailing  edge  of  a clock  pulse,  the  output  of  the 
delay  flip-flop  is  a logical  1 for  a time  interval  equal  to 
the  interval  timer  setting.  When  the  1 kHz  sine  wave  was 
present,  the  clock  input  to  the  delay  flip-flop  was  a pulse 
train  with  a 1 millisecond  period,  twice  the  amount  of  time 
set  on  the  interval  timer.  Thus,  as  long  as  the  sine  wave  was 
present  at  the  input  to  the  circuit,  the  output  of  the  delay 
flip-flop  was  a logical  1.  Upon  the  occurrence  of  the  last 
cycle  of  the  sine  wave,  the  pulse  train  input  to  the  delay 
flip-flop  would  end  and  the  output  of  the  delay  flip-flop 
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would  drop  to  logical  0,  2 milliseconds  after  the  presence  of 
the  trailing  edge  of  the  last  pulse.  The  transition  from 
logical  1 to  0 of  the  delay  flip-flop  was  used  to  activate  the 
data  sampler. 

The  data  was  sampled  at  8 kHz  (Nyquist  rate)  and  quantized 
to  12  bits  by  the  analog  to  digital  converter.  The  sampler 
took  20,480  samples  each  time  it  was  activated.  The  20,480 
samples  represent  a time  interval  of  approximately  2-1/2 
seconds  after  the  timing  mark,  the  time  interval  in  which 
Beeson  observed  that  each  word  was  recorded.  The  samples  were 
converted  to  actual  voltage  values  and  stored  on  digital  mag- 
netic tape  using  a Xerox  Sigma  7 digital  computer  integrated 
with  the  Comcor  CI5000/6.  In  all,  eleven  word  lists  were 
samples  (four  baseline  lists  and  seven  noise  corrupted  lists) 
and  stored  on  magnetic  tape. 

All  of  the  words  in  the  data  base  are  monosyllabic  so 
that  the  speech  signal  lasts  for  a time  interval  somewhat  less 
than  2-1/2  seconds.  The  problem  now  was  to  detect  which  of 
the  20,480  sample  values  taken  after  each  timing  mark  actually 
represented  the  speech  signal. 


Digital  Processing 

Since  the  baseline  tapes  were  recorded  in  an  almost  noise 
free  environment,  the  data  words  would  be  detected  in  the 
stream  of  data  samples  by  using  average  energy  criteria  to 
establish  thresholds.  The  data  stream  (20,480  samples)  for 
each  word  was  divided  into  160  128-point  windows  and  the  average 
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squared  sample  value  for  each  window  calculated.  The  noise 
that  is  present  in  the  sample  values  is  due  primarily  to 
analog  tape  hiss,  receiver  noise,  and  quantization.  This  noise 
was  assumed  to  be  additive  white  Gaussian  noise  with  a one- 
sided spectral  height  of  Nq.  By  physically  observing  the 
average  squared  sample  values  of  each  window  of  the  baseline 
words,  it  was  evident  that  an  upper  threshold  could  be  set  for 
Nq.  Anytime  the  average  squared  sample  value  was  greater  than 
the  threshold,  the  presence  of  signal  energy  due  to  the  spoken 
word  was  indicated.  With  the  assumption  that  the  noise  is 
Gaussian  with  a flat  spectrum,  the  signal  detection  scheme  is 
optimum.  The  windows  at  which  the  word  started  and  ended  were 
stored  on  a disk  file  as  well  as  the  number  of  windows  and 
samples  each  word  contained. 

Since  the  analog  data  tapes  were  re- synchronized  every  7 
seconds  and  only  the  first  2-1/2  seconds  of  the  7 second 
interval  was  actually  used,  it  was  assumed  that  the  location 
of  the  noise  corrupted  data  words  would  be  at  the  same  relative 
position  within  their  data  stream  as  the  baseline  words  were  in 
their  respective  data  streams.  The  optimum  receiver  for  a 
channel  corrupted  by  additive  white  Gaussian  noise  is  a correl- 
ation receiver.  Therefore,  to  prove  that  the  lists  were  indeed 
synchronized,  a cross-correlation  between  the  detected  samples 
of  the  baseline  word  and  the  data  stream  containing  the  same 
word  plus  jamming  noise  was  made.  It  was  hoped  that  a sharp 
peak  indicating  maximum  correlation,  and  therefore  word  align- 
ment, would  occur. 
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Unfortunately,  the  cross-correlation  showed  the  two 
sequences  to  be  uncorrelated,  and  no  information  on  tape 
alignment  could  be  gained.  Figure  2 shows  the  cross-correla- 
tion of  a baseline  word  with  the  receiver  output  signal  at 
the  lowest  jamming  level.  Two  factors  may  explain  the  absence 
of  correlation  between  the  baseline  word  and  the  noise  corrupted 
word.  First,  the  jamming  noise,  though  intended  to  be  addi- 
tive, could  have  also  had  a non-linear  effect  on  the  signal. 

The  correlation  receiver  can  no  longer  be  expected  to  work  when 
the  noise  component  is  not  additive.  Second,  spread  spectrum 
communications  involve  many  non-linear  processes  as  well  as  the 
spreading  and  despreading  of  a signal  over  a wide  bandwidth. 
Either  the  non-linear  processes  of  the  spreading/despreading 
could  change  the  spectrum  of  the  speech  enough  to  cause  zero 
correlation  between  the  baseline  signal  and  the  received  signal. 
The  possibility  of  the  occurrence  of  the  last  factor  is  streng- 
thened by  the  results  of  computing  the  Articulation  Index.  The 
results  are  discussed  in  Chapter  VI. 

The  synchronization  provided  by  the  timing  marks  is  repre- 
sentative of  the  data  gathering  techniques  available  in  field 
organizations.  Thus,  for  purposes  of  this  study,  the  tapes  are 
assumed  aligned  within  a close  enough  tolerance  to  objectively 
evaluate  the  use  of  LPC  based  measures.  Whether  the  failure  of 
the  LPC  based  measures  is  associated  with  the  inability  to  accu- 
rately measure  intelligibility  or  with  inadequate  tape  alignment 
is  immaterial  as  far  as  this  study  is  concerned.  The  main 
thrust  of  this  study  is  to  evaluate  the  effectiveness  of  the  LPC 
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measures  in  an  environment  representative  of  the  environment 
in  which  the  Articulation  Index  is  presently  used. 


Fig  2.  Cross  - correlation  of  Baseline 
Word  with  Noise  Corrupted 
Versions 
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III.  Linear  Predictive  Coding 

Linear  predictive  coding  (LPC)  has  found  widespread  use 
in  communications  theory  over  the  past  few  years.  Specific 
areas  of  interest  have  included  voice  encoding,  speaker  iden- 
tification, word  recognition,  and  spectrum  approximation. 

This  chapter  presents  the  basic  theory  behind  linear  prediction 
of  speech  and  one  solution  algorithm  for  formulation  of  the 
linear  prediction  speech  analysis  model. 

Linear  Prediction  of  Speech 

The  linear  prediction  of  speech  is  based  on  the  idea  that 
at  a particular  instant  in  time,  a sample  of  a speech  signal, 
S(nT),  can  be  approximated  by  a weighted  sum  of  the  preceding 
P samples  of  speech,  where  P is  an  integer.  This  idea  can  be 
expressed  mathematically  as 


S(nT)  = E a.S(nT-iT) 
i = l 1 


(3.1) 


To  simplify  notation,  Equation  3.1  is  most  often  written  in 
the  form  shown  below 


S (n)  = E a-S(n-i) 
i = l 


(3.2) 


where  it  is  assumed  that  S(m)  represents  the  mth  sample  value 
of  a speech  signal  sampled  every  T seconds.  Equation  3.2 
represents  an  approximation  to  the  speech  signal  and,  thus,  is 
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not  exact.  The  error  between  the  exact  speech  sample  during 
the  nth  sample  interval  and  its  approximation  can  be  defined 
by 

P 

e (n)  = S (n)  - E a.S(n-i)  (3.3) 

i=l  1 

The  goal  is  to  find  the  weights  (predictor  coefficients)  that 
will  minimize  the  error  in  some  sense  over  some  specified  time 
interval . 

A common  minimization  technique  is  to  minimize  the  total 
squared  error  over  a defined  interval.  By  defining  the  total 
squared  error  as  E,  the  goal  is  to  minimize  the  expression 

P 2 

E = E[S(n)  - Z a.S(n-i)]*  (3.4) 

n i = l 1 

where  the  limits  on  n define  the  interval  over  which  the  error 
is  to  be  minimized  and  are  deliberately  left  undefined  for  now. 
Equation  3.4  can  be  minimized  by  taking  the  partial  derivative 
of  E with  respect  to  each  predictor  coefficient,  a^,  and  set- 
ting the  result  equal  to  zero.  The  resulting  equation  is 
shown  below. 

P 

I 2 [S (n)  - Z a, S(n-k) ] [-S(n-i) ] = 0 (3.5) 

n k=l  K 

where 

i = 1 , 2 , . . . , P 

By  rearranging  terms  and  changing  the  order  of  summation, 
Equation  3.5  can  be  rewritten  as: 


(3.6) 


P 

Z a,  Z S (n-  k) S (n- i)  = -Z  S(n)S(n-i) 
k=l  n n 


At  this  point  the  limits  on  n must  be  defined  in  order  to 
solve  Equation  3.6. 

The  limits  on  n are  specified  by  the  choice  of  solution 
technique  for  Equation  3.6.  Two  common  solution  techniques 
are  the  Covariance  and  Autocorrelation  Methods.  The  Covariance 
Method  defines  the  minimization  of  E for  an  interval  of  n = 0, 
1,  . ..,  N-l  consecutive  samples.  The  Autocorrelation  Method 
defines  the  minimization  of  E for  an  interval  - °°  < n < + °°, 
but  defines  the  speech  signal  as 

/ S (n) , n =0,  1,  ....  N-l 

S(n)  = \ f3  7) 

' 0 , otherwise  ^ J 


For  this  study,  the  Autocorrelation  Method  was  chosen  since  it 
requires  fewer  calculations  in  the  solution  and  insures  the 
speech  analysis  model  constructed  is  stable.  A detailed 
analysis  of  both  solution  methods  is  contained  in  Reference  11. 

Having  specified  the  use  of  the  Autocorrelation  Method  for 
solution,  Equation  3.6  can  be  rewritten  as 
p +00  +00 

Z a,  Z S (n-k)  S (n-  i)  = -Z  S(n)S(n-i)  (3.8) 
k=l  n=  -°°  n=-<=° 


With  a change  of  index,  j = n-i.  Equation  3.8  can  be  written  as 


P +00 

, z ak  J S(j  + i-k)S(j) 

k=l  K 3 “"oo 


+ 


00 


(3.9) 
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The  estimate  of  the  autocorrelation  function  of  the  signal 
S(n)  is 


R(i) 


+ oo 


E S(n)S(n+i) 

n=  -oo 


(3.10) 


where 

R(i)  = R(-i) 

Using  the  definition  of  S(n)  as  given  in  Equation  3.7,  Equa- 
tion 3.10  can  be  written  as 

N-l-i 

R(i)  = E S(n)S(n+i)  (3.11) 

n=0 

For  cases  in  which  i = 1,  2,  ...,  P,  Equation  3.11  will  be 
defined  as  the  short-term  autocorrelation  of  the  signal  S(n). 
Substituting  Equation  3.11  into  Equation  3.9  yields 
P 

E a,  R(i-k)  = -R(i)  , i = 1,  2,  ....  P (3.12) 

k=l  K 

Once  the  short-term  autocorrelation  has  been  computed, 
Equation  3.12  represents  P linear  equations  that  can  be  solved 
simultaneously  for  each  a^.  Linear  algebra  techniques  exist 
for  efficiently  solving  Equation  3.12,  but  a recursive  solution 
has  been  developed  by  Levinson  that  provides  even  greater  com- 
putational efficiency  (Ref  12:129-148). 

Levinson's  Algorithm 

Levinson's  algorithm  provides  a recursive  solution  to 
Equation  3.12  that  is  both  simple  and  efficient  to  implement. 

To  simplify  the  notation  for  recursive  computation,  the 
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following  quantities  are  defined: 


A^P^  = the  ith  predictor  coefficient  of  the  Pth  order 
prediction  model 

r(n)  = normalized  short-term  autocorrelation  coefficient 


Using  the  above  definitions.  Equation  3.12  can  now  be  written 
as 


-r(j)  = .Z  A.(P)  r(i-j)  , j = 0,  1,  P (3.13) 


To  start  Levinson's  algorithm,  define  a new  quantity,  Kq , as 


K 


(0)  = r Cl) 
o r (0) 

(P) 


and  solve  recursively  for  J using 

K (P)[r(0)  -Z1  K.CP_1)  r(P-i)]  = r(P+l)  - Z K.  .r(i) 
i=0  1 i=l  1-1 


K. 

l 


(P)  « K (P-1)  . Y (P)  K(P-1) 

l-l  o Vi  , i = 1,  2,  ...  P 


(3.14) 


(3.15) 


(3.16) 


Having  calculated  K,  v* 1 , A.  '■*  can  be  calculated  from  the 


(P)  A (P+1) 

l ’ i 

following  equations;  define  A ^ = 1 


ACp+1)  r rrm  - * v (?) 


fP+11  ' E K.  r(P+l-i)]  = r(P-l)  - IA.mr(Pd-i)  (3.17) 

1 } i = 0 3 i-0  J 


Ai(P  + 1)  " Ai(P)  - K.(P)  A[P1}]  , i - 0,  1 P (3. 
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Two  vector  quantities  are  generated  as  a result  of  the  recur- 

fPl  fPl  fPl 

sive  computations,  A 1 and  . A is  the  vector  of 

predictor  coefficients  for  a P*"*1  order  model.  can  be 


(P) 


interpreted  as  a vector  of  reflection  coefficients;  each 
is  analogous  to  the  reflection  coefficients  of  a P-section 
transmission  line.  As  mentioned  earlier,  the  autocorrelation 
method  allows  the  model  to  be  checked  for  stability  prior  to 
implementation.  In  transmission  line  theory,  if  any  reflection 
coefficient  is  greater  than  1,  the  circuit  is  unstable. 

'.e , 

In  addition  to  producing  the  prediction  coefficients  and  reflec- 
tion coefficients,  the  algorithm  also  generates  the  minimum 


f PI 

Likewise,  if  any  Kj  1 is  greater  than  1,  the  model  is  unstabli 


total  squared  error  for  the  model.  Define  Eq  = 1 and  solve 
recursively  for  Ep+1  as  shown  below. 


Ep.l  - Ep  * A(pll)  tR(Pn)  - iE0KitP,r(i)l 


(3.19) 


Levinson's  algorithm  not  only  provides  the  prediction  coeffi- 
cients, reflection  coefficients,  and  total  squared  error  for 
a P order  model,  but  also,  since  the  algorithm  is  recursive, 
provides  the  same  quantities  for  all  models  less  than  order  P. 
A flow  chart  for  implementation  of  Levinson's  algorithm  is 
illustrated  in  Figure  3. 


21 


IV.  Distance  Measures 


This  chapter  describes  four  objective  measures  of  speech 
intelligibility  which  are  based  on  a vocal  tract  model  created 
by  linear  prediction.  The  term  distance  measure,  as  used  here, 
indicates  the  relative  distance  between  some  aspect  of  a base- 
line speech  signal  and  a distorted  version  of  that  same  speech 
signal.  For  a distance  measure  to  be  valuable,  it  should  be 
highly  correlated  with  subjective  scoring  results.  The  four 
distance  measures  described  here  were  all  tested  against  the 
subjective  scores  and  Articulation  Index  results  of  the  data 
base.  The  results  are  contained  in  Chapter  V. 

Vocal  Tract  Analysis  Model 

In  Chapter  III  the  error  between  a predicted  speech  signal 
and  the  actual  value  was  defined  as 

P 

e(n)  = S (n)  - I a.S(n-i)  (4.1) 

i=l  1 

By  defining  aQ  = 1,  the  above  equation  can  be  written  in 
Z-transform  notation  as 


E(Z)  = S(Z)  H(Z) 


(4.2) 


where 


H (Z)  = 1 - l a.Z 
i = l 1 


- 1 


H(Z)  is  defined  as  the  vocal  tract  analysis  model  and  can  be 
interpreted  as  an  all  zero  filter  of  the  Pt^'  order. 


L. 
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Fant  has  developed  a very  detailed  model  of  the  human  vocal 
tract  which  is  described  as  a time  varying  all  pole  filter  (Ref 
11:5-8).  The  filter  is  time  varying  since  it  must  model  the 
changes  in  the  vocal  tract  which  are  made  to  produce  different 
sounds.  Fant  has  shown,  however,  that  the  vocal  tract  and, 
therefore,  the  model  filter  pole  locations  remain  stationary 
for  a period  of  15-20  milliseconds  during  speech  production. 
Thus  H(z),  the  analysis  model,  can  be  interpreted  as  the  in- 
verse of  the  vocal  tract  model  described  by  Fant,  and  must  be 
updated  every  15-20  milliseconds.  The  updating  of  the  analysis 
model  is  required  in  order  to  account  for  the  changes  in  the 
time  domain  characteristics  of  speech  caused  by  changes  in  the 
vocal  tract. 

As  described  in  Chapter  II,  the  digitized  data  base  was 
sectioned  into  128-point  rectangular  windows  for  detection  of 
the  baseline  words.  The  128-point  sections  represent  a 16 
millisecond  interval  of  speech  and,  therefore,  a stationary- 
analysis  model  can  be  developed  for  each  section.  Markel  and 
Gray  have  shown  that  to  preserve  the  spectral  properties  of 
speech  when  using  linear  prediction  analysis,  a tapered  window 
should  be  applied  to  each  section  of  speech  (Ref  11:157).  A 
Hamming  window  of  the  form  shown  in  Equation  4.3  was  used. 

W (n)  = 0.54  - 0.46  cos  (j^j)  , 0 < n < N-l  (4.3) 

To  analyze  an  entire  word  required  the  development  of  a 
collection  of  analysis  models,  each  representing  a separate 
16  millisecond  segment  of  the  word.  The  collection  of  analysis 
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models  describing  the  characteristics  of  the  word  formed  the 
basis  for  all  four  distance  measures. 


Distance  Measure  1 

Distance  Measure  1,  DM1,  is  based  on  the  ratio  of  the  total 
squared  error  (TSE)  produced  by  passing  a baseline  word  through 
its  analysis  model  and  the  TSE  produced  by  passing  a distorted 
version  of  the  word  through  the  same  model.  Figure  4 illus- 
trates the  block  diagram  description  of  DM1.  S(n)  is  a 
128-point  segment  of  speech  and  S ’ (n)  is  the  same  speech  segment 
corrupted  by  some  type  of  distortion.  As  each  new  segment  of 
speech  is  to  be  analyzed,  the  baseline  speech  signal  is  analyzed 
to  determine  the  linear  prediction  coefficients  that  define  the 
analysis  model.  DM1  is  calculated  for  each  128-point  section 
of  the  word  and  averaged  over  all  word  sections  to  produce  an 
average  distance  measure  for  the  word.  DM1  is  defined  in 
Equation  4.4. 


DM1  = 


1 


NS 

Z 

k=l 


128 

Z 

n=0 


[ek(n)] 

rr 

'k 


leUn)  J 2 


1/2 


(4.4) 


where 

NS  = the  number  of  128-point  speech  segments  in  the  word 
being  analyzed 

' denotes  a quantity  associated  with  the  noise  corrupted 
word 

The  masimum  value  of  DM1  is  1.0  and  can  occur  only  when 
S(n)  and  S'(n)  are  identical.  As  the  distortion  present  in 
S’(n)  is  increased,  the  TSE  produced  by  S'(n)  should  increase 
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and,  therefore,  decrease  the  value  of  DM1. 


Distance  Measure  2 

Distance  Measure  2,  DM2,  is  identical  to  DM1  with  the 
exception  that  the  ratio  of  the  total  squared  errors  of  the 
two  speech  sections  is  normalized,  based  on  the  sum  of  the 
squares  of  the  speech  samples.  DM2  is  defined  in  Equation  4.5. 


DM2 


1 

NS 


NS 

E 

k=l 


RO'  (10 

'ko(10 


128 

E 

n=0 


[e.QJJ2 

[e ’ (n) ] 2 


(4.5) 


where 

RO(k)  is  the  sum  of  the  squared  sample  values  in  data 
section  k of  the  master  word 
RO'(k)  is  the  sum  of  the  squared  sample  values  in  data 
section  k of  the  distorted  word 


Distance  Measure  5 

Distance  Measure  3,  DM3,  is  based  on  an  intelligibility- 
measure  developed  by  Hartman  (Ref  8) . A block  diagram  illus- 
trating the  signal  processing  for  DM3  is  shown  in  Figure  5. 
Hartman's  measure  is  based  on  the  ratio  of  total  squared  error 
between  two  data  sequences  but,  as  opposed  to  DM1  and  DM2,  the 
measure  is  calculated  from  the  short-term  autocorrelation  and 
LPC  coefficients  of  the  two  signals,  instead  of  creating  an 
analysis  model  and  processing  both  signals.  DM3  is  based  on 
four  quantities:  E,  E',  D,  and  D',  which  are  defined  below. 

E Minimum  total  squared  error  calculated  from  Levinson's 
algorithm  by  analyzing  an  undistorted  section  of  speech 


E'  Minimum  total  squared  error  calculated  from  Levinson's 
algorithm  by  analyzing  a distorted  section  of  speech 

D Total  squared  error  created  by  comparing  an  undistorted 
section  of  speech  with  the  predictor  coefficients  cal- 
culated from  the  corresponding  section  of  the  distorted 
speech 

D'  Total  squared  error  created  by  comparing  a distorted 
section  of  speech  with  the  predictor  coefficients  cal- 
culated from  the  corresponding  section  of  undistorted 
speech 

All  four  quantities  can  be  expressed  in  matrix  notation  by  the 
following  equations. 

E = AT  R A (4.6) 

E*  = A'T  R’A'  (4.7) 

D ■ A,T  R A'  (4.8) 

D'  = AT  R'A 

where 

primed  quantities  indicate  the  value  associated  with 
distorted  speech 

T denotes  the  transpose  of  a vector 
T 

A is  a P+1  element  vector  of  predictor  coefficients 
T 

A = (1,  ~a2  * •••>  ~ap) 

R is  a P+1  by  P+1  matrix  of  short-term  autocorrelation 
value 

Rij  = R( I i- 3 1 ) 

As  stated  earlier,  E and  E'  are  a direct  result  of  Levin- 
son's algorithm,  but  D and  D'  must  be  calculated  using 
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Equations  4.8  and  4.9.  However,  computational  time  for  D and  D' 
can  be  saved  by  exploiting  the  symmetrical  properties  of  the 
autocorrelation  matrices.  The  R matrix  is  structured  such  that 
the  elements  of  each  diagonal  are  equal  so  that  D and  D'  can  be 
calculated  from  Equations  4.10  and  4.11. 


D = Z g’  (i)r(i) 
i-0 


(4.10) 


where 


g(i) 


D'  = Z g(i)r ' (i) 
i = 0 


k^Qa  kak+i  ’ 1=1,  2' 


(4.11) 


g(0)  = T a.  ' 
k=0  K 


g’  (i) 


z a,ka'k+i  » i=1»  2» 
k=0  K K 1 


g'(0)  = Z a'  2 
k=0  K 


t h 

ak  and  a'k  are  the  k elements  of  the  A and  A'  vectors  of 
predictor  coefficients.  Two  distance  measures  are  defined 
based  on  the  ratios  D'/E'  and  D/E. 


El  = In  (D'/E') 
E2  = In  (D/E) 


(4.12) 

(4.13) 


where  In  denotes  the  natural  logarithm. 
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To  facilitate  the  comparison  of  El  and  E2  to  the  subjec- 
tive scores,  thresholds  are  established  for  El  and  E2  and  the 


thresholded  quantities  are  scale'd  to  a range  of  0 to  1.  A 
value  of  1 indicates  that  the  speech  is  completely  understand- 
able, while  a value  of  0 indicates  complete  misunderstanding. 
Hartman  established  thresholds  for  determining  whether  a 
segment  of  speech  is  completely  understood  or  completely  mis- 
understood based  on  the  work  of  Flanagan  (Ref  7)  and  Sabur  and 
Jayant  (Ref  12).  Initially,  the  thresholds  were  set  so  that 
a value  for  El  or  E2  greater  than  2.46  indicated  the  speech 
was  totally  unintelligible.  At  the  other  end  of  the  scale,  a 
value  less  than  0.82-  indicated  that  the  speech  was  completely 
intelligible.  An  intelligibility  metric  was,  therefore, 

defined  by  1 

/ 


METRIC1  / 

METRIC2  < 

k 


1,  if  El  < 0.82  completely  intelligible 

0,  if  El  > 2.46  completely  unintelligible  (4.14) 
2.46  - El 

2 . 4' 6 ~ '0  .'82  ’ otherwise 

1,  if  E2  < 0.82  completely  intelligible 

0,  if  E2  > 2.46  completely  unintelligible (4 . 15) 

2.46  - E2 

2.46  - 0 . 8' 2 ’ otherwise 


Hartman  found  that  the  establishment  of  fixed  thresholds 
proved  unsatisfactory  for  predicting  intelligibility  and  he, 
therefore,  suggested  that  the  thresholds  be  adjusted  depending 
on  the  character  of  the  noise  present.  In  determining  the 
threshold  adjustments,  eight  128-point  segments  of  noise  were 
taken  from  the  data  tapes  and  LPC  analysis  performed  on  the 
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samples.  For  the  noise  segments  with  the  largest  and  smallest 
values  of  the  sum  of  the  squares  of  predictor  coefficients, 

E1N  and  E2N  were  calculated  as  defined  below 


E1NH  = In (D ' /E  ' ) 

calculated  for  noise  segment 

with 

(4.16) 

E2NH  = In (D/E) 

largest  sum  of  squares 

(4.17) 

E1NL  = In (D ' /E ' ) 

calculated  for  noise  segment 

with 

(4.18) 

E2NL  = In (D/E) 

smallest  sum  of  squares 

(4.19) 

where  primed  quantities  in  these  equations  indicate  values 
associated  with  noise  samples,  and  unprimed  quantities  indicate 
values  associates  with  samples  of  the  baseline  word.  E1NH  and 
E1NL  are  averaged  to  produce  E1N,  and  E2NH  and  E2NL  are  aver- 
aged to  produce  E2N.  E1N  and  E2N  are  calculated  for  each 
segment  of  speech  to  be  analyzed  and  the  intelligibility  thres- 
holds are  adjusted  according  to  the  following  equations. 


T1E1 

T2E1 

T1E1 

T2E1 

T1E2 

T2E2 

T1E2 

T2E2 


0.82 

2.46 

0.82  + 0 . 82 (E1N 
2.46  + 0.82 (E1N 
0.82 
2.46 

0.82  + 0.82 (E2N 
2.46  + 0 . 8 2 (E2N 


if  E1N  < 2.46 

2.46) 

if  E1N  > 2.46 

2.46) 

if  E2N  < 2.46 

2.46) 

2.46)  if  E2N  > 2'46 


(4.20) 

(4.21) 

(4.22) 

(4.23) 


With  the  adjustment  of  the  thresholds  based  on  the  character  of 
the  noise,  the  two  metrics  as  defined  in  Equations  4.14  and 
4.15  were  modified  as  shown  next. 


J 
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METRIC 1 


T2E1  - El 


METRIC  2 


T2E2  - F.2 


if  El  < T1E1 
if  El  > T2E1 


otherwise 


if  E2  < T1E2 
if  E2  > T2E2 


otherwise 


(4.24) 


(4.25) 


Modified  thresholds  were  established  for  each  segment  of 
speech  to  be  analyzed  and  METRIC1  and  METRIC2  calculated. 
METRIC1  and  METRIC2  were  averaged  over  all  segments  of  the 
word  being  analyzed  to  produce  METRIC1  and  METRIC2 . METRlCl 
and  METRIC2  are  then  averaged  to  produce  DM3. 


MetRici  = 


Z METRIC1  (k) 
s k=l 


= I METRIC  2 (k) 

NS  k=i 


DM3  = 


(4.26) 


(4.27) 


(4.28) 


where 


NS  is  the  number  of  128-point  speech  segments  contained 
in  the  word  being  analyzed. 


Distance  Measure  4 

Distance  Measure  4,  DM4,  is  identical  to  DM3  except  that 
METRIC!  and  METRIC 2 are  a weighted  average  based  on  the  dis- 
tribution of  signal  power  in  the  word  being  analyzed.  The 
average  power,  RO , in  the  word  being  analyzed  is  calculated 
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by  computing  the  average  sample  squared  value  of  the  undis- 
torted word.  Values  obtained  for  METRIC1  and  METRIC2  when  the 
average  signal  power  in  the  segment  of  speech  being  analyzed 
was  greater  than  Eft/2  were  averaged  to  produce  MIH  and  M2H , 
respectively.  Conversely,  values  for  METRIC1  and  METRIC2  for 
segments  with  an  average  signal  power  less  than  Eft/ 2 were 
averaged  to  yield  MIL  and  M2L . DM4  was  then  defined  by 


where 


DM4 


RT  + Rft 

2 


J5JJ-  = MlH  + Mir 
^ = mth  + mu 


(4.29) 


In  general,  only  about  one-third  of  the  speech  segments 
will  have  an  average  power  less  than  Eft/2.  Therefore,  the 
weighted  average  used  to  calculate  DM4  has  the  effect  of 
emphasizing  the  intelligibility  of  low  power  segments  more 
than  high  power  segments. 
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V.  Results 


As  mentioned  in  Chapter  II,  the  data  base  used  in  this 
study  had  been  subjectively  scored  by  a listener  group  of  ten 
people.  Each  listener  was  given  a score  sheet  containing  each 
pair  of  rhyming  words  from  the  master  lists.  As  the  listener 
heard  each  word  of  the  data  (noise  corrupted)  list,  he  marked 
which  word  of  the  master  pair  he  perceived  was  said.  A sub- 
jective score  was  developed  for  each  data  list  by  calculating 
the  percent  of  right  answers  on  each  listener's  score  sheet. 
The  average  subjective  score  for  the  listener  group  was  used 
as  the  subjective  measure  of  speech  intelligibility.  The  re- 
sults of  the  subjective  scoring  for  each  word  list  are  shown 
in  Table  II . 


TABLE  II 


Subjective  Scores 


Jamming  Level 

1 

2 

3 

4 

5 

6 

7 

Subjective  Score 

90 

92 

93 

84 

84 

86 

79 

The  subjective  measure  for  each  word  list  was  the  standard 
by  which  the  effectiveness  of  the  LPC  based  measures  and  the 
Articulation  Index  were  judged. 

Articulation  Index 

The  Articulation  Index  (AI)  was  calculated  using  the  one- 
third  octave  band  method  (Ref  2:11-15).  This  method  differs 
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slightly  from  the  standard  20-band  method  described  in  Chapter 
I,  but  can  be  more  easily  calculated  in  an  automated  manner 
using  existing  audio  test  equipment.  Table  III  illustrates 
the  measurements  and  calculations  involved  in  the  one-third 
octave  method. 

A Bruel  and  Kjaer  Digital  Frequency  Analyzer  Type  2131  was 
used  to  measure  the  signal  power  in  each  of  the  one-third 
octave  bands,  and  a Sony  Model  TC-850  tape  recorder  was  used 
for  analog  data  input  to  the  analyzer.  A Hewlett-Packard  9845A 
mini-computer  was  used  to  control  the  frequency  analyzer  and  to 
calculate  the  Articulation  Index.  The  steps  involved  in  com- 
puting the  AI  are  detailed  below. 

Step  1.  A baseline  analog  tape  was  played  into  the  fre- 
quency  analyzer.  The  analyzer  computed  the  average  power  in 
each  of  the  one-third  octave  bands  over  a period  of  128  seconds. 

Step  2.  At  the  end  of  128  seconds,  the  mini-computer 
sampled  the  average  power  figures  for  each  of  the  frequency 
bands  and  stored  the  results  (Column  2 of  Table  III). 

Step  5.  A noise  corrupted  data  tape  was  played  through  the 
frequency  analyzer  and  the  average  power  calculated  in  each 
band  over  a 128  second  interval. 

Step  4.  The  mini -computer  sampled  the  average  power  figures 
for  the  noisy  tape  and  stored  the  results  (Column  3 of  Table  III). 

Step  5.  The  average  noise  power  in  each  of  the  bands  was 
calculated  by  subtracting  the  baseline  signal  power  from  the 
signal  plus  noise  power  (Column  4 of  Table  III). 

Step  6.  The  signal-to-noise  ratio  (SNR)  was  calculated  for 
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TABLE  III 

One-Third  Octave  Band  Method  for  Computing  Articulation  Index 

COLIMN  1 COLUMN  2 COLUMN  3 COLUMN  4 COLUMN  5 Column  6 COLUMN  7 COUJMN  7 


ARTICULATION  INDEX 


each  frequency  band  (Column  5 of  Table  III). 

Step  7.  The  SNR  value  for  each  band  was  adjusted  according 
to  Equation  5.1. 

SNR  =30,  if  SNR  > 30 

SNR  = 0 , if  SNR  <0  (5.1) 

Otherwise  no  adjustment  is  made 

Step  8.  The  adjusted  SNR  for  each  band  was  multiplied  by 
the  weighting  factor  shown  in  Column  6 and  the  values  for  each 
band  summed  to  produce  the  Articulation  Index. 


TABLE  IV 

Articulation  Index 


Jamming  Level 

B 

2 

3 

a 

5 

6 

n 

Articulation  Index 

.49 

. 50 

— 

. 52 

.44 

. 37 

.37 

.36 

The  Articulation  Index  was  calculated  for  each  of  the  noise 
corrupted  tapes.  A scatter  plot  comparing  the  AI  to  the  sub- 
jective scores  is  shown  in  Figure  6.  The  correlation  between 
the  AI  and  the  subjective  scores  was  computed  using  Equation  5.2. 


•(Xi-X) (Y.-7) 
[^(Xi-F)2*'(Yi-7)2|1/2 


where 

denotes  the  ith_  value  of  the  Articulation  Index 
is  the  mean  value  of  the  Articulation  Index 
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Y.  denotes  the  ith  subjective  score 
1 

Y.  is  the  mean  value  of  the  subjective  scores 
1 


LPC  Measures 

Computer  programs  were  developed  for  computing  each  of  the 
LPC  based  distance  measures.  A Xerox  Sigma  7 computer  was  used 
for  exercising  each  of  the  measures  on  the  data  base.  Figures 
7 through  10  are  scatter  plots  comparing  each  of  the  measures 
to  the  subjective  scores.  As  with  the  Articulation  Index,  the 
correlation  between  the  LPC  measures  and  the  subjective  scores 
is  shown. 


TABLE  V 

LPC  Distance  Measures 


Fig  6 


Articulation  Index 


Fig  7.  Distance  Measure  1 
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Fig  8.  Distance  Measure  2 
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Fig  9.  Distance  Measure  3 
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Fig  10.  Distance  Measure  4 


VI.  Conclusions  and  Recommendations 


— 

Conclusions 

No  LPC  based  distance  measure  produced  a correlation  with 
the  subjective  scores  that  can  be  considered  significant  for 
a valid  distance  measure.  As  mentioned  in  Chapter  II,  two 
reasons  may  explain  the  failure  of  the  LPC  based  measures. 

First,  inadequate  tape  alignment  procedures  may  have  caused 
the  noise  corrupted  word  samples  to  have  been  shifted  relative 
to  the  corresponding  baseline  word  samples.  However,  as 
stated  before,  the  intent  of  this  research  was  to  evaluate  the 
use  of  LPC  based  measures  under  conditions  similar  to  those 
under  which  the  Articulation  Index  is  presently  used.  The 
use  of  timing  marks  on  the  analog  data  tapes  can  be  considered 
a reasonable  and  realistic  way  of  providing  tape  synchroniza- 
tion under  normal  data  collection  conditions.  The  failure  of 
LPC  measures  solely  because  of  tape  alignment  does  not  invali- 
date the  findings  of  this  report,  but  rather  reinforces  the 
claim  that  LPC  measures  require  synchronization  in  excess  of 
what  can  be  realistically  provided  under  field  use  outside  the 
laboratory.  Hartman  reported  that  software  implementation  of 
his  LPC  based  measures  resulted  in  70%  of  the  computer  time  and 
851  of  the  manpower  requirements  being  directly  attributable 
to  the  data  alignment  procedures.  Unless  LPC  based  measures 
can  be  proven  to  be  considerably  more  effective  than  the 
Articulation  Index,  the  extensive  overhead  associated  with  data 
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alignment  makes  the  LPC  measures  impractical  for  field  use. 

The  second  reason  the  LPC  measures  may  have  failed  is 
that  LPC  measures  may  be  incapable  of  predicting  speech  intelli- 
gibility in  the  presence  of  the  distortion  types  used  here. 
Computation  of  the  Articulation  Index  did  provide  some  in- 
sight into  the  differences  between  the  baseline  speech  signal 
spectrum  and  the  noise  corrupted  data  that  may  indicate  the 
types  of  distortion  present.  The  digital  frequency  analyzer 
indicated  that  in  the  frequency  range  0-355  Hz,  the  baseline 
speech  signal  contained  more  power  than  the  baseline  signal 
plus  noise  as  output  from  the  receiver.  Two  possible  explana- 
tions may  account  for  this  inconsistency  in  power  spectrums. 
First,  spread  spectrum  communications  involve  spreading  a 
relatively  narrowband  signal  (4  kHz)  into  a relatively  wide- 
band signal  (30  mHz).  The  spreading  of  the  signal,  in 
addition  to  any  non-linear  processing  within  the  transmitter/ 
receiver  pair,  could  cause  a frequency  translation  in  the 
speech  signal.  Second,  the  use  of  high  pass  filters,  such  as 
in  pre-emphasis,  would  have  the  effect  of  decreasing  the  power 
in  the  lower  frequencies  of  the  transmitted  speech  signal. 

Since  the  baseline  speech  signal  represents  speech  before  it 
enters  the  transmitter,  it  is  conceivable  that  the  low  fre- 
quency spectrum  of  the  baseline  signal  would  be  greater  than 
the  received  version  of  the  baseline  signal  plus  noise. 

Hartman  showed  that  like  the  AI , LPC  intelligibility  measures 
compare  the  frequency  spectrum  of  two  signals  (Ref  8:29). 

This  indicates  that  if  there  is  frequency  distortion  present 
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in  the  signal  to  be  measured  for  intelligibility,  the 
accuracy  of  LPC  measures  is  in  doubt.  It  is  interesting  to 
note,  however,  that  the  AI  apparently  predicts  intelligibility 
in  the  presence  of  frequency  distortion  of  the  kind  evident 
in  this  experiment. 

The  Articulation  Index  proved  to  be  significantly  more 
efficient  to  compute  in  terms  of  manhours  and  equipment  as 
compared  to  the  LPC  based  measures.  The  inefficiency  of  LPC 
measures  is  primarily  due  to  the  synchronization  requirements 
and  the  present  lack  of  commercially  available  LPC  hardware. 
Additionally,  the  AI  provided  much  more  accurate  results. 

Since  the  LPC  based  measures,  like  the  AI,  are  based  on  the 
comparison  of  the  frequency  spectrum  of  two  signals,  a per- 
formance advantage  of  LPC  measures  over  the  AI  is  doubtful. 
Interestingly,  both  studies  of  LPC  measures  mentioned  in  this 
report,  References  4 and  8,  failed  to  compare  LPC  measures 
with  the  Articulation  Index.  Unless  a clear  performance  advan- 
tage of  LPC  based  measures  over  the  AI  can  be  proved,  the 
continuation  of  research  measuring  speech  intelligibility 
using  LPC  measures  is  questionable.  Once  the  superiority  of 
LPC  measures  is  proven,  work  must  be  done  on  developing  an 
efficient  LPC  based  system  which  can  function  with  limited 
overhead  and  under  field  conditions  such  as  real-time  testing 
of  voice  communications  channels. 

Recommendations 

As  mentioned  in  the  introduction  to  Chapter  I,  a real  need 
exists  within  the  military  for  an  efficient  and  effective 
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automated  speech  intelligibility  measure.  The  most  success- 
ful automated  measure  presently  used  by  the  military  is  the 
Articulation  Index.  Research  into  the  use  of  LPC  for  intelli- 
gibility measures  must  first  examine  the  sensitivity  of  LPC 
calculations  to  data  alignment.  If  adequate  alignment 
techniques  can  be  developed  and  proven  useable  under  field 
conditions,  the  further  development  of  LPC  measured  is 
warranted. 

The  most  pressing  need  identified  during  the  conduct  of 
this  research  is  the  need  for  an  extensive  data  base  of 
testable  speech  received  over  an  actual  communications 
channel  in  the  presence  of  common  channel  distortions.  To 
date,  the  majority  of  data  bases  are  extremely  small  and 
involve  distortions  simulated  in  the  laboratory  or  by  com- 
puter, rather  than  real  word  distortions.  Once  such  a data 
base  can  be  created,  an  extremely  useful  and  valuable  tool 
will  be  present  for  judging  the  relative  merits  of  automated 
speech  intelligibility  measures. 
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